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C^ ' Abstract 

5: 

^H , In this report, we unify two quite distinct approaches to information retrieval: region 

\^ • models and language models. Region models were developed for structured document 

Cn I retrieval. They provide a well-defined behaviour as well as a simple query language that 

■ allows application developers to rapidly develop applications. Language models are par- 

^^ I ticularly useful to reason about the ranking of search results, and for developing new 

HH ' ranking approaches. The unified model allows application developers to define complex 

c/3 I language modeling approaches as logical queries on a textual database. We show a re- 

, ^/ markable one-to-one relationship between region queries and the language models they 

I represent for a wide variety of applications: simple ad-hoc search, cross-language retrieval, 

T-H ■ video retrieval, and web search. 

>■ 

!:Q ; 1 Introduction 

• I The introduction of the relational model by Codd in 1970 [13] marks one of the success stories 

^^ ■ of computer science. The relational model laid the path for the development of relational 

O . database systems: general software tools for management of data with a well-understood 

. . I and well-defined behaviour. They allow application developers to rapidly develop application 

^ • programs that are easy to understand, document and teach [16]. Indeed, saying "databases" 

^ . is saying "relational": Virtually any introductory book or course on databases will teach the 

H ' basics of the relational data model and SQL. 

■ - - ' It can be argued that information retrieval is still at the stage where databases were in the 

1960's. There is no such thing as an equivalent of the relational model for information retrieval 

systems. Introductory books and courses on information retrieval OHS] will teach the student 

several information retrieval models - mostly focusing on different ranking strategies - each 

with its own strengths and weaknesses. Developing a retrieval application or deploying a 

search engine requires applications to call non-standard application program interfaces (APIs) 

and use non-standard query languages. 

As an example, the Terrier system, a research information retrieval system developed by 
the University of Glasgow [H], is based on the so-called divergence of randomness models 
[1]. Terrier provides APIs for indexing and querying. To use the Terrier indexing API on a 
non-standard collection (Terrier comes with some fully implemented APIs, for instance for 
HTML documents), the application developer needs to create an object which impleinents the 



collection interface. This will find all the files it has to process, and opens each one to create 
a document object which identifies which tags (or other byte sequences) act as document 
delimiters. Applications programs that work with this setup will be logically impaired if the 
file locations or document format (for instance the XML DTD) need to be changed. Or, 
in analogy with Codd's [13j analysis of the database systems from the 1960's: The retrieval 
system does not provide access path independence. 

As another example, the Lemur toolkit [40j is a research retrieval system that is specif- 
ically designed to support research in language modeling [JH [361 US]- The toolkit supports 
a broad range of different applications of information retrieval such as ad hoc retrieval, dis- 
tributed retrieval, cross-language retrieval, etc. Lemur supports at least four different index 
types, each supporting different kinds of queries. For instance, some indexes include word 
positions to allow proximity queries, whereas others only allow very basic functionality. Ap- 
plication programs that work with one kind of index might be logically impaired if the index 
type is changed. In analogy with Codd [13j, the retrieval system does not provide indexing 
independence)^ 

In the past, we have used systems like Terrier and Lemur to research new applications of 
information retrieval technology such as cross-language retrieval [23], web retrieval [28j, and 
video shot retrieval [25j . To develop such retrieval approaches, it was necessary to reimplement 
parts of the existing system: reimplementing APIs, introducing new APIs, introducing new 
query languages, and even introducing new indexing and storage structures. In this report, 
we present a framework that supports all such approaches by means of a simple yet powerful 
query language (similar to SQL or relational algebra) that hides the implementation details 
of retrieval approaches from the application developer. As such, the system provides access 
path independence and indexing independence. 

There have been other attempts to develop approaches to information retrieval that pro- 
vide data independence. For instance, Schek [lU] describes methods for integrating databases 
and information retrieval systems where application programs and queries are not aware of 
access paths and indexes. Fuhr [19j describes a layered system design for information re- 
trieval systems following the ANSI/SPARC model [55], distinguishing a physical (internal) 
layer, a conceptual layer and an external layer. The system might process queries in several 
ways, such as directly by an index, or by using an index as a filter with an additional scan of 
the filtered results. Probabilistic relational algebra or probabilistic Datalog (see ^18j for an 
overview) might serve as conceptual query languages in such systems. An example of a system 
that implements this approach is HySpirit [21]. In this report we introduce an alternative 
for probabilistic relational algebra and probabilistic Datalog that is much closer to existing 
models of information retrieval. 

1.1 Region models 

Motivated by the data independence issues described above, Burkowski |llj proposes a 
mathematical framework which he called the containment model that operates on sets of 
contiguous extents. We will call extents regions in this report, and the model region model. 
A region might be a word, a phrase, a text element such as a title, or a complete document. 
Burkowski's model comes with a small number of basic operators on sets of regions, the 
most important ones being SN (select narrow) and SW (select wide). A search for chapters 



^Codd identified one more type of data independence: ordering independence. As textual data is inherently 
ordered we are not concerned with ordering independence. 



containing the word "databases" would be expressed as <chapter> sw databases, and if the 
appHcation program only needs to put the chapter's title on the screen, the query would be 
<chapter_title> SN (<chapter> SW databases). In Burkowski's framework, the application 
program does not know how a text collection and its index facilities are managed. The 
complexity of the retrieval system is encapsulated in a module that only responds to simple 
command strings like the ones above. Similar frameworks are introduced by Salminen and 
Tompa |47j . Clarke et al. [12], Baeza- Yates and Navarro [1], Consens and Milo [14J, and 
Jaakkola and Kilpelainen [26j. We will call the models underlying these approaches region 
models in this report. 

Unlike Codd's relational model for databases, the region models above did not have a 
big impact on the information retrieval research community, nor on the development of new 
retrieval systems. The reason for this is quite obvious: region models do not explain in 
anyway how search results should be ranked. In fact, most region models are not concerned 
with ranking at all; one might say they - like the relational model - are actually data models 
instead of information retrieval models. Region model approaches that do address ranking, 
like Burkowski's model [llj and the approach by Masuda et al. [32], only include it as an 
after-thought: Retrieve first, then rank with some standard retrieval model such as a vector 
space model using tf. idf weights [l8] . 

1.2 Language models 

If anything, an approach to information retrieval has to address the ranking of search results. 
Ranking is the single most important feature of a search engine, and information retrieval 
modeling almost exclusively focuses on ranking (see e.g. O Chapter 2]). Traditionally, devel- 
oping ranking strategies involves engineering, fitting and tuning term weighting approaches 
to improve experimental results [48j, although there are some notable exceptions, for instance 
the probabilistic model by Robertson and Sparck- Jones ^Bj. A more recent approach that 
does not require lots of fitting and tuning are statistical language models for information re- 
trieval [211 [SSI US]- Language models assign a probability to a piece of text. They are built for 
each document: Each document model assigns a probability to a text query, and documents 
are ranked accordingly. Language models have been applied to a wide variety of retrieval 
problems, such as simple ad-hoc search [2H [271 IM], cross-language retrieval [71 [231 [33 [56], 
video retrieval using speech transcripts [151 [IS] i and web search [271 IMl [39] . Examples of 
these applications will be shown in Section [3l 

1.3 Unifying region models and language models 

In this report we introduce an approach to information retrieval that fully integrates region 
models and language models. The approach allows application developers to define complex 
language modeling approaches as logical region queries on a textual database. We show a 
remarkable one-to-one relation between region queries and the language models they represent 
for the four retrieval problems mentioned above: ad-hoc search, cross-language retrieval, video 
retrieval, and web search. The report is organised as follows. In Section [2] we introduce the 
combined region/language model. Section[3|illustrates the application of the model by relating 
probability measures to region queries. Finally in Section [D we present future work and relate 
the approach to current work on XML query languages and XML database systems. 



2 A region model for text databases and a query language 

This section briefly introduces the unified region/language model. The definitions closely 
follow Burkowski's model [11], which we extend with region scores similar to the score region 
algebra we used for XML information retrieval |31| . 

A textual database consists of a finite sequence of words wi,W2,- ■ ■ ,Wn-i, where Wi 
is used to denote the word on position i in the database. Additionally, the textual database 
consists of a hierarchy of text elements. Both words and elements are identified by the 
word positions in the database. Text elements are sequences of words that have a particular 
significance in the database. For example, a database with recipes will have text elements 
"ingredients", "quantities", "instructions", etc., typically marked up as XML. 

A scored region r is defined by two integers r. start and r.end (1 < r. start < r.end < n), 
and a float r. score {r. score > 0)o The integers start and end represent respectively the 
position of the first word that belongs to the contiguous region, and the position directly 
following the last word that belongs to the region. A region might be a text element, but also 
any other contiguous sequence of words. Note that the region {i,i + l,s) includes one (and 
only one) word Wi with a score s. 

Retrieval from the textual database is done with a simple query language consisting of 
words, elements and five basic operators: CONTAINING, CONTAINED_BY, SCALE, AND, and 
OR. The language defines an algebra on sets of scored regions. Unlike Burkowski's model 
|llj . there are no additional constraints on sets of regions. We will now one-by-one define 
the language primitives in a rather informal way. For convenience, Figure [1] contains a more 
formal definition of the operators using SQL. 

A word A single word, for example the query banana, produces a set of regions R, where 
each region r & R defines a position of the word in the textual database; r. start being 
the position on which the word occurs, r.end = r. start + 1, and r. score = 1. 

An element A single element, for instance the query <recipe> produces a set of regions 
R, where each region r £ R is tagged as "recipe", r. start being the position of the first 
word of the XML element, r.end being the position following the last word of the XML 
element, and r. score = 1. 

Ri CONTAINING R2 The operator CONTAINING takes two sets of regions Ri and R2, and 
produces the subset of regions from Ri that contain at least one region from R2. For 
instance, the query <recipe> CONTAINING banana produces all regions tagged as "recipe" 
that contain at least one occurrence of "banana". Inspired by language models, each 
"recipe" region is scored by the number of occurrences of "banana" in the region, divided 
by the length of the region (measured as r.end — r. start). Occurrences of "banana" are 
weighted by their length and by their score (of course, in the example query both length 
= 1 as well as score = 1); see Figured) 

Ri CONTAINED _B Y i?2 The operator CONTAINED_BY takes two sets of regions Ri and R2, 
and produces the subset of regions from Ri that are at least contained by one region from 
i?2- For instance, the query <ingredient> C0NTAINED_BY <recipe> produces all ingredients 
that belong at least to one recipe. If a region from the left-hand side of the expression 
is nested in more than one region from the right-hand side of the expression, then the 



^We intentionally use a notation that is close to that of the relational data model; see also Figure [T] 



scores of those regions are added. This wih be used in the next section to express the 
hnear combination of several language models; see Figure [TJ 

/ SCALE R The operator SCALE takes a float / and a set of regions R and produces all 
regions from R where each region r € i? is scored as / • r. score. For instance, the query 
. 2 SCALE banana produces the set of regions with the positions of the word "banana" 
all with a region score of 0.2; see Figure [TJ 

Ri AND i?2 The operator AND takes two sets of regions Ri and R2, and produces only those 
regions that are both in Ri and R2, i.e., the intersection of both sets when ignoring the 
region scores. Each region in the result is scored by multiplying its scores in Ri and 
i?2- For instance, the query (<recipe> CONTAINING banana) AND (<recipe> CONTAINING 
apple) produces all regions tagged as "recipe" that contain both the word "banana" 
and the word "apple" , scored by the product of the scores of the respective regions; see 
Figure [H 

Ri OR R2 The operator OR takes two sets of regions Ri and R2, and produces those regions 
that either are in Ri, or in R2, i.e., the union of both sets when ignoring the region 
scores. For instance, the query (<recipe> CONTAINING sugar) OR (<recipe> CONTAINING 
sweet) produces all regions tagged as "recipe" that contain either the word "sugar" 
or the word "sweet" (or both). Regions keep their score, unless both sets contain the 
region, in which case the region is scored by adding its scores in Ri and R2; see Figured! 



— RI CONTAINING R2 

SELECT RI. start, Rl.end, RI. score * SUM( (R2 . score * 

(R2.end - R2. start)) / (Rl.end - RI. start)) AS score 
FROM RI, R2 

WHERE RI. start <= R2. start AND Rl.end >= R2.end 
GROUP BY RI. start, Rl.end, RI. score 

— RI CONTAINED_BY R2 

SELECT RI . start, RI . end, RI . score * SUM(R2 . score) AS score 
FROM RI, R2 

WHERE RI. start >= R2. start AND Rl.end <= R2.end 
GROUP BY RI. start, Rl.end, RI. score 

— f SCALE R 

SELECT R. start, R.end, f * R. score AS score 
FROM R 

— RI AND R2 

SELECT RI. start, Rl.end, RI. score * R2. score AS score 

FROM RI, R2 

WHERE RI. start = R2. start AND Rl.end = R2 . end 

— RI OR R2 

SELECT R. start, R.end, SUM (R. score) AS score 

FROM (SELECT * FROM RI UNION ALL SELECT * FROM R2) AS R 

GROUP BY R. start, R.end 



Figure 1: Definition of operators in SQL. 



Figure [T] contains a definition of the operators using SQL, as a pragmatic means to provide 



a formal definition of the region algebra operators witliout tlie need to get into specific math- 
ematical notations. So, we show SQL definitions here for convenience, as we assume most 
readers are familiar with SQL. The definitions do not suggest in any way that the system 
should be implemented on top a relational databases system. We implemented the system 
- without the use of SQL - on top of MonetDB [31], but it might as well be implemented 
using traditional inverted file indexes on the file systemo 

A natural application of the region model, is to support structured queries in an XML 
information retrieval system. The following query is an example XML information retrieval 
query formulated in NEXL NEXI [5lj stands for narrowed extended XPath, a query language 
that restricts XPath ^ by only allowing descendent axis steps, and that extends XPath by 
a special about operator that ranks the selected nodes by their estimated relevance to the 
query. NEXI is used to evaluate XML retrieval systems in the Initiative for the Evaluation 
of XML retrieval (INEX) [2D]. Suppose we want to retrieve sections about "databases" from 
articles that mention "book review" in either the article title (atl) or the keywords (kwd): 

//article [about (. //(atl I kwd) , book review)] //sec [about (. , databases)] 

This can be formulated as follows as a region query: 

(<sec> CONTAINING databases) CONTAINED_BY (<article> CONTAINING 
(((<atl> OR <kwd>) CONTAINING book) CONTAINING review)) 



This approach is followed with success in INEX by the TIJ AH system |31|. [35] . The expression 
defines a ranking of the selected nodes. Rewriting the NEXI query to the region expression 
is not trivial, but relatively easy: TIJAH has a NEXI to region query parser. 

In the next section we show the relationship between language modeling ranking definitions 
and region queries, similar to the relationship between NEXI queries and the region queries. 



3 Logical queries for complex retrieval tasks 

3.1 The simplest unigram language model 

As said in the introduction, language models form a general approach to define ranking 
formulas for retrieval applications. A language model is assigned to every document. The 
language model of the document defines the probability that the document 'generates' the 
query. Documents are ranked by this probability. The simplest language modeling approach 
to information retrieval would be defined by Equation [TJ 

I 
P{Ti,T2,---,Ti\D) = l[P{Ti\D) (1) 

It defines the probability of a query of length / given a document D as the product of the 
probabilities of each term Tj (1 < i < /) given D. A language model that takes a simple 
product of terms, i.e., a model that assumes that the probability of one term given a document 
does not depend on other terms, is called a unigram language model. To make this work. 



•^For readers that do want to implement this on top of a relational DBMS, please note that 'Rl.end' clashes 
with the SQL reserved word 'END' in practical systems. 



we have to define the basic probabihty measure P{T\D); typicahy, it would be defined as 
the number of occurrences of the term T in the document D, divided by the total number of 
terms in the document D. For a practical query, say, retrieve all documents about "db" and 
"ir", we would instantiate Equation [T] as follows: 

P(ri=db,r2 = ir|D) = P{Ti=dh\D) ■ P{T2 = ir\D) (2) 

The right-hand side of the equation corresponds to the following region expression. 

(<doc> CONTAINING db) AND (<doc> CONTAINING ir) (3) 

This can be shown as follows: The region expression (<doc> CONTAINING db) produces all 
documents ranked according to P{T = db\D), i.e., all regions tagged as <doc>, ranked by the 
number of occurrences of db in those regions. Similarly, (<doc> CONTAINING ir) produces all 
documents ranked according to P{T = ir\D). Finally, the operator AND results in the regions 
tagged as <doc> that are in both operand sets. The score of the result regions is defined as 
the product of the scores of the same regions in the operands. Here, and in the remaining 
examples in this section, we assume that <doc> regions do not nest inside each other. 

We claim that there is a trivial way to rewrite the right-hand side of Equation [2] to 
Equation [3] while preserving the outcome. This can be shown by simply replacing P(x\y) by 
(y CONTAINING x) , and the multiplication in Equation [2] by AND. Regions that are assigned zero 
probability by the probability measure of Equation [2] are not retrieved by the region expression 
of Equation [31 So, the region expression selects all y for which P{x\y) > 0. If the probability 
measure assigns zero probability to a region then this implies that the corresponding region 
expression will not retrieve it; and, if a region is not retrieved by a region expression then 
this implies that its corresponding probability function assigns zero probability to it. 

3.2 Linear interpolation smoothing 

The simple language model presented in the previous section assigns zero probability to a 
document unless it contains all query terms. So, if none of the documents contains all terms, 
the system does not retrieve anything. This behaviour will be appropriate for many practical 
applications. In fact, it is the default behaviour of web search engines like Google and Yahoo. 
For other applications, it might be undesirable to have empty results. When searching 
collections that are significantly smaller than the web, it is likely that precise queries will 
not retrieve anything. In practice, language modeling approaches therefore use a technique 
called "smoothing", i.e., some probability mass is assigned to terms that do not occur in 
the document. The standard language modeling approach uses a mixture of the document 
model P{Ti\D) with a general collection model P{Ti\C) [3 [H ES El EH ED] , called linear 
interpolation smoothing. 

I 
P(ri, Ta, ■■■,Ti\D) = lliil-X)P{Ti\C) + AP(r,|Z))) (4) 

The document model P{Ti\D) assigns zero probability to terms that do not occur in the 
document D, but the collection model P(Tj|C) assigns some probability to any term that 
occurs somewhere in the collection. The collection model probabilities are defined similar to 
the document model probabilities as: The number of occurrences of the term T in the total 



collection C, divided by the total number of terms in the collection C. The approach needs 
a parameter A (0 < A < 1) which is set empirically. 

For our example query, we need some value for A to instantiate Equation [H Suppose we 
decide A = 0.8, then we would rank documents according to: 

P(T = db,T = ir|L>) = 
(0.2-P(ri=db|C)+0.8-P(ri=db|D)) 

(0.2-P(T2 = ir|C) + 0.8-P(T2 = ir|Z))) 

The equation corresponds to the following region expression, where the text element <root> 
corresponds to the collection root, i.e., the whole database. 

(<doc> CONTAINED_BY 

((0.2 SCALE (<root> CONTAINING db)) OR (0.8 SCALE (<doc> CONTAINING db) )) ) 
AND (6) 

(<doc> CONTAINED_BY 

((0.2 SCALE (<root> CONTAINING ir) ) OR (0.8 SCALE (<doc> CONTAINING ir) ) ) ) 

This can be shown as follows: The region expression (<root> CONTAINING db) results in a set 
with the single region <root> with a score equal to the number of occurrences of db in <root>, 
i.e., P{T\C). The SCALE operator will multiply the region with 0.2; and the OR will union the 
region with all document regions (with scores P[T\D) as in the previous section), multiplied 
with 0.8 by the SCALE operator. Note, that the OR operator will not actually add 0.2 • P{T = 
db|C) to 0.8-P(r = db|L'): This will be done by the CONTAINED_BY operator: every document 
region on the left-hand side of this operator matches (because every document region is 
contained by the collection root). Document regions that are in the set 0.8 SCALE (<doc> 
CONTAINING db) wiU get as their final score: 0.2 • P(r = db|C) + 0.8 • P(r = db|i:>); the others 
will get: 0.2-P(T = db|C). The same line of reasoning can be done for the part with the term 
ir. Finally, the AND operator combines both parts of the query as in the previous section. 

Again, we claim there is a trivial way to rewrite the right-hand side of Equation [5] to 
Equation [6l This can be shown by simply replacing P{x\y) by (y containing x), the multi- 
plication operator '•' by AND if both operands are regions, or by SCALE if the first operand is a 
number; the addition operator '+' by or, and by putting "z C0NTAINEDJ3Y" in front of the 
expression, where z defines the elements that need to be retrieved. 

It might be argued that this very last step - "putting CONTAINEDJ3Y in front" - is not a 
trivial step, and we did not use it in the previous section. However, we might as well use it in 
the previous section: It is easy to show that (<doc> CONTAINING db) AND (<doc> CONTAINING ir) 
produces the same regions, with the exact same scores as (<doc> contained_by (<doc> containing 
db)) AND (<doc> C0NTAINED_BY (<doc> CONTAINING ir)), because the elements on the left-hand 
side of both C0NTAINED_BY operators all have unit score, and because elements on the left- 
hand side are nested in at most one region from the right-hand side of the CONTAINED J3Y 
operator. So, the general procedure that rewrites probability measures to region expressions 
should use the C0NTAINEDJ3Y operator for every query term. Equivalences between region 
expressions will be addressed briefly in Section 14. 1[ 

3.3 Video shot retrieval using speech transcripts 

Now that we showed linear interpolation smoothing, it is easy to generalise this to any linear 
combination of language models. Such models have been quite successful in spoken document 



retrieval for retrieving video shots [151 [25], where videos are modeled as sequences of scenes, 
each consisting of sequences of shots. The language model mixes four different levels of the 
video hierarchy: shots, scenes, complete videos and the total collection as: 

P{Ti,T2,--;Ti\Shot) = 

f[{aP{Ti\C) + pP{Ti\Video) + -iP{Ti\Scene) + 5P{Ti\Shot)) ^^' 

i=X 

where a + (3 + ^ + 5 = 1. The main idea behind this approach is that a good shot contains 
the query terms, and is part of a scene that contains the query terms, which is part of a video 
that contains even more of the query terms. Suppose we are looking for the exact shots in a 
collection of videos where a knight says "ni"|f| and we take a = 0.18, /? = 0.02, 7 = 0.4, and 
5 = 0.4 then the shots would be ranked according to: 

P{T = ni.\Shot) = 

(0.18-P(r = ni|C) + 0.02-P(r = ni|l/ideo) (8) 

+ 0.4-P(T = ni|5cene) + QA-P{T = n±\Shot)) 

which corresponds to the following region expression. 

<sllot> CDNTAINED_BY 

((0.18 SCALE (<root> CONTAINING ni)) OR (0.02 SCALE (<video> CONTAINING ni)) (9) 
OR (0.4 SCALE (<scene> CONTAINING ni) ) OR (0.4 SCALE (<shot> CONTAINING ni))) 

Showing that the region expression of Equation [U] retrieves and ranks video shots according 
to Equation [8] is done as in the previous section. 

3.4 Web retrieval with page priors 

For web retrieval, non-content information like the number of hyperlinks pointing to a web 
page, or the form of the URL are good indicators of the importance of a page. Such approaches 
can be modeled by so-called document priors P{D) that do not depend on the query [271 ISHl 
I39j . Document priors are calculated once for the entire collection, stored in the system and 
then used to enhance retrieval results for every query. A good example of such an approach 
is Google's PageRank algorithm [lUj . 

Document priors are motivated as follows. Instead of ranking documents by the proba- 
bility that they generate the query, it makes more sense to rank them by P(Z)|Ti,T2,- • -^Ti): 
The probability that D is relevant given the query Ti,T2,- ■ ■,Ti of length I. According to 

Bayes' rule: 

P{D)-P{Ti,T2,---,Ti\D) 



P{D\Ti,T2,---,Ti: 



oc 



p{D) ■ n p(n\D) 



The denominator, P{Ti,T2, ■ ■ ■ ,Ti), does not depend on D and can therefore be dropped, 
but document prior, P{D), cannot be dropped unless it is uniformly distributed over all 



*From the movie "Monty Python and the Holy Grail" 



documents. Suppose we are looking for the entry page of Google. Documents will be ranked 
as follows. 

P(i:)|T = google) oc P(L»)-P(T = google|i:)) (11) 

To follow this approach, the system needs to have some means to store text elements with their 
prior probability. Suppose an application program calculated the PageRank of each crawled 
web page resulting in probabilities P{D) (or any number proportional to the probabilities, 
see |lUj ) for each document region, which is stored as $PageRank. The dollar sign is used to 
denote a region set that is stored by the system for later use. The set is used in the query as 
follows. 

$PageRank AND (<doc> CONTAINING google) (12) 

We believe the correspondence between Equation [11] and [12] is obvious. As before, the query 
$PageRank AND (<doc> C0NTAINED_BY (<doc> CONTAINING google)) would be a more general query 
that produces the exact same results. 

3.5 Cross-language information retrieval 

In cross-language information retrieval, a collection in one language, e.g. English, is searched 
by querying it in another language, e.g. Dutch. A language modeling approach to cross- 
language retrieval ranks documents by the probability P(S'i, 5*2, • • • , S'^jD) of generating a 
Dutch query Si, S2, ■ ■ ■ ,Si of length I from the English document D. This is modeled by the 
following procedure: first an English word T is generated from a document with probability 
P(T\D), then the English term is translated to Dutch independently from the document it 
was generated from, so with probability P{S\T), resulting in [71 [231 156]: 

I V 

P{Si,S2,---,Si\D) = IIY.(P(S,\T,)P{T,\D)) (13) 

where P{Tj\D) is again the document language model, and P(Si\Tj) is a translation model 
defining the probabilities of the source language words (for instance Dutch in case of a Dutch 
query) given the target language words (English if the collection being searched is English), 
and where V is the size of the target language vocabulary. Such a model is used as follows: 
Given a Dutch query Si,S2,- ■ ■ ,Si, every word might have several possible translations in 
English. Suppose we want to use the Dutch query gebroken hart (English: "broken heart") 
to search for English documents. The application program would consult its dictionary to 
determine that there are two possible English translations for the Dutch word "gebroken": 
"broken" and "fractured" . The probability of translating "broken" to "gebroken" , i.e. P{S = 
gebrokenjT = broken) might be estimated as 1.0, for instance because from example texts we 
know that the English word "broken" is always translated to "gebroken" ; and the probability 
of translating "fractured" to "gebroken", i.e. P{S = gebrokeii|T = fractured) might be 
estimated as 0.2 (note that the two probabilities do not need to sum up to 1). In this case, 
an instantiation of Equation [T3] would be: 

^(5*1 = gebroken, 5*2 =hart|Z)) = 
(1.0-P(ri=broken|L>) + 0.2 ■ P{Ti=fr&ctvired\D)) 

{0.5 ■ P{T2=heaTt\D) + 0.1 • P(T2 =ticker|L>)) 
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So, the sum over the whole target language vocabulary will in practice be a sum over the 
possible translations only (those for which P(S\T) > 0). The probability function corresponds 
to the following region expression. 

((1.0 SCALE (<doc> CONTAINING broken)) OR (0.2 SCALE (<doc> CONTAINING fractured))) 

AND (15) 

((0.5 SCALE (<doc> CONTAINING heart)) OR (0.1 SCALE (<doc> CONTAINING ticker))) 

Equation [15] can be generated from [TJ] as shown in the previous sections. 



4 Discussion, open issues and future work 

In this report, we presented a unified region model / language model approach and showed 
its expressiveness for a wide range of applications of language modeling: ad-hoc retrieval, 
smoothing, video retrieval, web search and cross-language retrieval. In the past, we have 
developed separate prototype retrieval systems for these approaches. Developing these pro- 
totype systems meant we had to reimplement parts of our system: reimplementing APIs, 
introducing new APIs, introducing new query languages, introducing new indexes, introduc- 
ing new storage structures, etc. This report shows that such approaches can be supported by 
a single retrieval system that responds to a simple query language that hides implementation 
details of information retrieval approaches from the application developer. 

The relationship between the region queries and the language modeling probability func- 
tions might seem trivial because we "hard-wired" the language modeling probability definition 
in the CONTAINING operator, but we believe it is remarkable: Note that the language mod- 
eling probability functions are arithmetic expressions that define the probability of a single 
document D. However, the region queries are algebraic expressions for processing sets of 
documents (regions) instead of single documents. Since the region query language forms a 
"bulk algebra", experiences from relational database system design can be used to develop 
efficient implementations of such a system, possibly up to a point where applications run as 
fast as, or possibly even faster than, the dedicated prototypes we developed in the past. 

4.1 Query optimization 

The queries presented in Section [2] are close to the language modeling probability functions. 
However, there exist alternative expressions of the queries that produce equivalent results 
but that might be easier to process by the system. Based on a study into equivalence rela- 
tions for region models fB3|, we conjecture that the following expressions are alternatives for 
the expressions presented in Section [2l (<doc> containing db) containing ir is an alternative 
for Equation El (<doc> CONTAINED.BY (((0.2 SCALE <root>) OR (0.8 SCALE <doc>)) CONTAINING 
db)) CONTAINED_BY (((0.2 SCALE <root>) OR (0.8 SCALE <doc>)) CONTAINING ir) is an alterna- 
tive for Equation El <shot> CONTAINED_BY (((0. 18 SCALE <root>) OR (0 . 02 SCALE <video>) OR 
(0.4 SCALE <sceiie>) OR (0.4 SCALE <shot>)) CONTAINING ni) is an alternative for Equation 
El $PageRank CONTAINING google is an alternative for Equation [T21 finally (<doc> CONTAINING 
(broken OR (0.2 SCALE fractured))) CONTAINING ((0.5 SCALE heart) OR (0.1 SCALE ticker)) is 
an alternative for Equation [151 
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Additionally, query optimization would involve choosing concrete evaluation methods at- 
tached to each operation, estimating the costs of each method, and choosing the fastest plan. 
Ramirez and De Vries [44J present preliminary results. 

4.2 Towards existing XML query languages 

It can be argued that region models are simple predecessors of models underlying XML 
query languages like XPath [8] and XQuery [9]. That is, operators like CONTAINED J3Y and 
CONTAINING can be seen as ancestor and descendent axis steps, as well as the function 
fn: contains in XPath. It would be relatively easy to add other XPath axis steps to the query 
language if we specify how regions are nested, for instance by requiring that a region has a 
level (the depth in the XML tree) as well as a start, end, and score. 

XML and its subsequent standards like XPath and XQuery have initiated a lot of research 
into XML database systems with dedicated workshops and symposia like DataX [33] and 
XSym [6j. Our implementation of the region approach is quite similar to implementations 
of XML databases that use relational database technology and a numbering of the XML 
nodes [52j . Interestingly, the word positions that belong to the region start and region end 
of an XML element are respectively in pre-order and post-order as in the XML database 
implementation proposed by Grust [22] . Our prototype system TIJ AH uses part of the code 
of the PathFinder XML database system [53j. In the future, both systems might be integrated 
following the XQuery full-text standard [2 [3] . 

4.3 Towards new applications of XML 

Some people have argued that existing XML query languages like XPath [8] and XQuery 
[S] are too powerful for simple XML information retrieval functionality [S3]. Others have 
argued that existing query languages are not powerful enough. For instance Ogilvie [38] 
illustrates a system that answers queries like "Who killed Abraham Lincoln" by a query that 
returns those <person> elements that directly precede the word killed, which directly precedes 
another <person> element containing lincoln. Such a query would be hard, if not impossible, 
to express in existing XML query languages. A solution might be the introduction of a special 
gluing operator in our region model approach, let's call it ADJ for "adjacent", which can glue 
regions to form bigger regions. Such an operator might be used for phrases, but also to glue 
for instance two paragraphs together to form a region that spans two paragraphs. We have 
implemented such a gluing operator in our video retrieval system that, lacking a reliable scene 
detector, glues adjacent shots together to represent a scene [25] . 

4.4 Beyond XML 

Ogilvie [38] also makes a case for allowing several hierarchies of possibly overlapping elements 
which combined would no longer form a tree. This need is illustrated as well by Burkowski 
[TT] . by people studying the bible [17], and it is picked up by several initiatives to extend XML 
[32l[51]. The region approach described here would support querying of such representations 
quite naturally. 
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